brier_score_loss (Brier score)#

The Brier score is a proper scoring rule for probabilistic binary classification.
It measures the mean squared error between predicted probabilities and the actual outcomes.

In this notebook you will:

  • understand the definition and its range

  • build intuition with plots

  • implement brier_score_loss from scratch in NumPy

  • use the Brier score as a differentiable training objective for a simple logistic regression model


import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, brier_score_loss
from sklearn.model_selection import train_test_split

pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

rng = np.random.default_rng(0)
def sigmoid(z: np.ndarray) -> np.ndarray:
    z = np.asarray(z, dtype=float)
    out = np.empty_like(z)

    pos = z >= 0
    out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))

    exp_z = np.exp(z[~pos])
    out[~pos] = exp_z / (1.0 + exp_z)
    return out


def logit(p: np.ndarray, eps: float = 1e-12) -> np.ndarray:
    p = np.asarray(p, dtype=float)
    p = np.clip(p, eps, 1.0 - eps)
    return np.log(p / (1.0 - p))


def calibrate_logit_scale(p: np.ndarray, k: float, eps: float = 1e-12) -> np.ndarray:
    # Monotone calibration distortion: p -> sigmoid(k * logit(p)).
    return sigmoid(k * logit(p, eps=eps))


def calibration_bins(y_true: np.ndarray, y_prob: np.ndarray, n_bins: int = 10):
    # Uniform-bin calibration curve + counts.

    y_true = np.asarray(y_true, dtype=float).ravel()
    y_prob = np.asarray(y_prob, dtype=float).ravel()

    if y_true.shape != y_prob.shape:
        raise ValueError("y_true and y_prob must have the same shape")

    bins = np.linspace(0.0, 1.0, n_bins + 1)
    bin_id = np.digitize(y_prob, bins) - 1
    bin_id = np.clip(bin_id, 0, n_bins - 1)

    mean_pred = np.full(n_bins, np.nan)
    frac_pos = np.full(n_bins, np.nan)
    counts = np.zeros(n_bins, dtype=int)

    for b in range(n_bins):
        mask = bin_id == b
        counts[b] = int(mask.sum())
        if counts[b] == 0:
            continue
        mean_pred[b] = float(y_prob[mask].mean())
        frac_pos[b] = float(y_true[mask].mean())

    return bins, mean_pred, frac_pos, counts


def brier_decomposition(y_true: np.ndarray, y_prob: np.ndarray, n_bins: int = 10):
    # Murphy (1973) style decomposition using bins as 'forecast categories'.

    y_true = np.asarray(y_true, dtype=float).ravel()
    y_prob = np.asarray(y_prob, dtype=float).ravel()

    bs = float(np.mean((y_prob - y_true) ** 2))

    _, mean_pred, frac_pos, counts = calibration_bins(y_true, y_prob, n_bins=n_bins)
    mask = counts > 0

    n = y_true.size
    w = counts[mask] / n
    rel = float(np.sum(w * (mean_pred[mask] - frac_pos[mask]) ** 2))

    y_bar = float(y_true.mean())
    res = float(np.sum(w * (frac_pos[mask] - y_bar) ** 2))

    unc = float(y_bar * (1.0 - y_bar))

    return bs, rel, res, unc

1) Definition#

For binary events we observe labels

\[ y_i \in \{0,1\} \]

and we predict probabilities for the positive class

\[ p_i = \mathbb{P}(y_i = 1 \mid x_i), \qquad p_i \in [0,1]. \]

The Brier score (loss) is

\[ \mathrm{BS}(y,p) = \frac{1}{n}\sum_{i=1}^n (p_i - y_i)^2. \]
  • Lower is better; 0 is perfect.

  • For binary classification with \(p_i \in [0,1]\), the score lies in \([0,1]\).

In scikit-learn, sklearn.metrics.brier_score_loss(y_true, y_prob, ...) expects y_prob = P(y=pos_label) (not hard class predictions).

Multiclass note#

sklearn.metrics.brier_score_loss is binary-only. A common multiclass generalization uses the probability vector \(p_i \in \mathbb{R}^K\) and a one-hot label \(e_{y_i}\):

\[ \mathrm{BS}_K = \frac{1}{n}\sum_{i=1}^n \lVert p_i - e_{y_i} \rVert_2^2 \quad (\text{sometimes divided by }K). \]
p = np.linspace(0.0, 1.0, 501)

fig = go.Figure()
fig.add_trace(go.Scatter(x=p, y=(p - 1.0) ** 2, mode="lines", name="y=1: (1-p)^2"))
fig.add_trace(go.Scatter(x=p, y=(p - 0.0) ** 2, mode="lines", name="y=0: p^2"))

fig.update_layout(
    title="Per-example Brier loss as a function of predicted probability",
    xaxis_title="Predicted probability p = P(y=1)",
    yaxis_title="Loss (p - y)^2",
    template="plotly_white",
)
fig.show()

2) Proper scoring rule (why “probabilities” matter)#

Assume the true event probability is \(q\) and

\[ Y \sim \mathrm{Bernoulli}(q). \]

If we always predict the same probability \(p\), the expected Brier score is:

\[ \mathbb{E}[(p - Y)^2] = (p-q)^2 + q(1-q). \]
  • The second term \(q(1-q)\) is the irreducible uncertainty in the labels.

  • The first term \((p-q)^2\) is the penalty for miscalibration.

So the expected Brier score is minimized at:

\[ p^* = q. \]

That’s the defining property of a (strictly) proper scoring rule: telling the truth (predicting the real probability) is optimal.

q = 0.3  # true event probability
n = 50_000

y = rng.binomial(1, q, size=n).astype(float)
p_grid = np.linspace(0.0, 1.0, 401)

empirical = np.mean((p_grid[None, :] - y[:, None]) ** 2, axis=0)
analytic = (p_grid - q) ** 2 + q * (1.0 - q)

fig = go.Figure()
fig.add_trace(go.Scatter(x=p_grid, y=empirical, mode="lines", name="empirical E[(p-Y)^2]"))
fig.add_trace(
    go.Scatter(
        x=p_grid,
        y=analytic,
        mode="lines",
        name="analytic (p-q)^2 + q(1-q)",
        line=dict(dash="dash"),
    )
)
fig.add_vline(x=q, line=dict(color="black", dash="dot"))
fig.update_layout(
    title=f"Expected Brier score is minimized at p = q (here q={q})",
    xaxis_title="Forecast p",
    yaxis_title="Expected Brier score",
    template="plotly_white",
)
fig.show()

3) Best constant predictor + Brier skill score#

If you predict the same probability \(p\) for every sample, the Brier objective is

\[ \min_p \frac{1}{n}\sum_{i=1}^n (p - y_i)^2. \]

Taking the derivative and setting it to zero gives:

\[ p^* = \bar{y} = \frac{1}{n}\sum_{i=1}^n y_i. \]

So the best constant forecast is simply the base rate.

A common normalization is the Brier Skill Score (BSS) against a reference forecast (often the base rate):

\[ \mathrm{BSS} = 1 - \frac{\mathrm{BS}_{\text{model}}}{\mathrm{BS}_{\text{ref}}}. \]
  • BSS = 1 is perfect

  • BSS = 0 matches the reference

  • BSS < 0 is worse than the reference

y = rng.binomial(1, 0.2, size=3000).astype(float)
p_star = y.mean()

p_grid = np.linspace(0.0, 1.0, 201)
losses = np.array([np.mean((p - y) ** 2) for p in p_grid])

fig = go.Figure()
fig.add_trace(go.Scatter(x=p_grid, y=losses, mode="lines", name="Brier(p)"))
fig.add_vline(
    x=p_star,
    line=dict(color="black", dash="dot"),
    annotation_text=f"mean(y) = {p_star:.3f}",
)
fig.update_layout(
    title="Best constant probability equals the base rate",
    xaxis_title="Constant prediction p",
    yaxis_title="Brier score",
    template="plotly_white",
)
fig.show()

print(f"Best constant p*: {p_star:.4f}")
print(
    f"Brier score at p*: {np.mean((p_star - y) ** 2):.4f} "
    f"(≈ p*(1-p) = {p_star*(1-p_star):.4f})"
)
Best constant p*: 0.2087
Brier score at p*: 0.1651 (≈ p*(1-p) = 0.1651)

4) Calibration intuition (and the Murphy decomposition)#

Accuracy only uses a hard decision (e.g. p >= 0.5).
The Brier score uses the full probability values, so it can distinguish between:

  • a well-calibrated forecast: “when I say 0.8, it happens ~80% of the time”

  • an overconfident forecast: probabilities pushed too close to 0 or 1

  • an underconfident forecast: probabilities too close to 0.5

For binary forecasts, the Brier score can be decomposed into three interpretable parts (Murphy, 1973):

\[ \mathrm{BS} = \underbrace{\mathrm{Rel}}_{\text{reliability}} - \underbrace{\mathrm{Res}}_{\text{resolution}} + \underbrace{\mathrm{Unc}}_{\text{uncertainty}}. \]

Using probability bins (or exact forecast categories), with:

  • \(p_k\) = average predicted probability in bin \(k\)

  • \(o_k\) = observed frequency in bin \(k\)

  • \(n_k\) = number of samples in bin \(k\)

  • \(\bar{y}\) = base rate

the components are:

\[ \mathrm{Rel} = \sum_k \frac{n_k}{n}(p_k - o_k)^2 \]
\[ \mathrm{Res} = \sum_k \frac{n_k}{n}(o_k - \bar{y})^2 \]
\[ \mathrm{Unc} = \bar{y}(1-\bar{y}). \]

Intuition:

  • Reliability is (binned) calibration error (smaller is better)

  • Resolution rewards separating cases with different true frequencies (bigger is better)

  • Uncertainty is a property of the dataset only

n = 10_000
x = rng.normal(size=n)

# A "true" probability model
p_true = sigmoid(1.5 * x - 0.25)
y = rng.binomial(1, p_true)

# Same hard predictions (threshold at 0.5), but different calibration
p_calibrated = p_true
p_overconf = calibrate_logit_scale(p_true, k=2.0)
p_underconf = calibrate_logit_scale(p_true, k=0.5)

models = {
    "calibrated (truth)": p_calibrated,
    "overconfident (k=2)": p_overconf,
    "underconfident (k=0.5)": p_underconf,
}

print("Model comparison (same threshold decisions, different probability quality)")
for name, p in models.items():
    y_hat = (p >= 0.5).astype(int)
    acc = accuracy_score(y, y_hat)
    bs = brier_score_loss(y, p)
    _, rel, res, unc = brier_decomposition(y, p, n_bins=10)
    approx = rel - res + unc
    print(
        f"{name:22s}  accuracy={acc:.3f}  brier={bs:.4f}  "
        f"rel={rel:.4f}  res={res:.4f}  unc={unc:.4f}  (rel-res+unc≈{approx:.4f})"
    )

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Reliability diagram (calibration curve)", "Predicted probability distribution"),
)

fig.add_trace(
    go.Scatter(
        x=[0, 1],
        y=[0, 1],
        mode="lines",
        name="perfect calibration",
        line=dict(color="black", dash="dash"),
    ),
    row=1,
    col=1,
)

for name, p in models.items():
    _, mean_pred, frac_pos, counts = calibration_bins(y, p, n_bins=10)
    mask = counts > 0
    fig.add_trace(
        go.Scatter(x=mean_pred[mask], y=frac_pos[mask], mode="markers+lines", name=name),
        row=1,
        col=1,
    )
    fig.add_trace(go.Histogram(x=p, name=name, opacity=0.45, nbinsx=30), row=1, col=2)

fig.update_xaxes(title_text="Mean predicted probability", range=[0, 1], row=1, col=1)
fig.update_yaxes(title_text="Observed frequency", range=[0, 1], row=1, col=1)

fig.update_xaxes(title_text="Predicted probability", range=[0, 1], row=1, col=2)
fig.update_yaxes(title_text="Count", row=1, col=2)

fig.update_layout(
    barmode="overlay",
    template="plotly_white",
    title="Calibration affects Brier score even when accuracy is unchanged",
)
fig.show()
Model comparison (same threshold decisions, different probability quality)
calibrated (truth)      accuracy=0.734  brier=0.1763  rel=0.0001  res=0.0703  unc=0.2477  (rel-res+unc≈0.1775)
overconfident (k=2)     accuracy=0.734  brier=0.1879  rel=0.0110  res=0.0691  unc=0.2477  (rel-res+unc≈0.1896)
underconfident (k=0.5)  accuracy=0.734  brier=0.1880  rel=0.0110  res=0.0687  unc=0.2477  (rel-res+unc≈0.1900)

5) From-scratch NumPy implementation#

For binary labels, the implementation is just mean squared error:

\[ \mathrm{BS} = \frac{1}{n}\sum_{i=1}^n (p_i - y_i)^2. \]

Two practical details (matching sklearn.metrics.brier_score_loss):

  1. y_prob should be a probability for the positive class.

  2. If your labels are not {0,1}, you must specify which value is positive (pos_label).

def brier_score_loss_np(y_true, y_prob, sample_weight=None, pos_label=None) -> float:
    y_true = np.asarray(y_true).ravel()
    y_prob = np.asarray(y_prob, dtype=float).ravel()

    if y_true.shape != y_prob.shape:
        raise ValueError("y_true and y_prob must have the same shape")

    if pos_label is None:
        unique = np.unique(y_true)
        if np.all(np.isin(unique, [0, 1])):
            pos_label = 1
        else:
            raise ValueError("pos_label must be specified when y_true is not in {0,1}")

    y01 = (y_true == pos_label).astype(float)
    y_prob = np.clip(y_prob, 0.0, 1.0)

    sq = (y_prob - y01) ** 2

    if sample_weight is None:
        return float(np.mean(sq))

    w = np.asarray(sample_weight, dtype=float).ravel()
    if w.shape != y_true.shape:
        raise ValueError("sample_weight must have the same shape as y_true")
    if np.any(w < 0):
        raise ValueError("sample_weight cannot contain negative weights")

    return float(np.average(sq, weights=w))


# Quick sanity checks vs scikit-learn

y = rng.integers(0, 2, size=1000)
p = rng.random(1000)

print("sklearn:", brier_score_loss(y, p))
print("numpy  :", brier_score_loss_np(y, p))
print("allclose:", np.allclose(brier_score_loss(y, p), brier_score_loss_np(y, p)))

# pos_label example

y_pm = rng.choice([-1, 1], size=1000)
p = rng.random(1000)
print("pos_label=1 (sklearn):", brier_score_loss(y_pm, p, pos_label=1))
print("pos_label=1 (numpy)  :", brier_score_loss_np(y_pm, p, pos_label=1))

# sample_weight example

w = rng.random(1000)
print("weighted (sklearn):", brier_score_loss(y, p, sample_weight=w))
print("weighted (numpy)  :", brier_score_loss_np(y, p, sample_weight=w))
sklearn: 0.3392506425102862
numpy  : 0.3392506425102862
allclose: True
pos_label=1 (sklearn): 0.3285352441601213
pos_label=1 (numpy)  : 0.3285352441601213
weighted (sklearn): 0.3431820394131602
weighted (numpy)  : 0.3431820394131602

6) Brier vs log loss (cross-entropy)#

Both Brier score and log loss are proper scoring rules, but they penalize mistakes differently.

For a single example:

  • Brier: \((p-y)^2\) (bounded in \([0,1]\))

  • Log loss: \(-y\log p -(1-y)\log(1-p)\) (unbounded for confident wrong predictions)

Log loss heavily punishes being confidently wrong (e.g. predicting \(p \approx 0\) when \(y=1\)), while the Brier score is more forgiving.

p = np.linspace(1e-3, 1.0 - 1e-3, 500)

brier_y1 = (1.0 - p) ** 2
brier_y0 = p**2

logloss_y1 = -np.log(p)
logloss_y0 = -np.log(1.0 - p)

fig = make_subplots(rows=1, cols=2, subplot_titles=("y=1", "y=0"))

fig.add_trace(go.Scatter(x=p, y=brier_y1, mode="lines", name="Brier"), row=1, col=1)
fig.add_trace(go.Scatter(x=p, y=logloss_y1, mode="lines", name="Log loss"), row=1, col=1)

fig.add_trace(go.Scatter(x=p, y=brier_y0, mode="lines", name="Brier"), row=1, col=2)
fig.add_trace(go.Scatter(x=p, y=logloss_y0, mode="lines", name="Log loss"), row=1, col=2)

fig.update_xaxes(title_text="Predicted probability p = P(y=1)", row=1, col=1)
fig.update_xaxes(title_text="Predicted probability p = P(y=1)", row=1, col=2)

fig.update_yaxes(title_text="loss", row=1, col=1)
fig.update_yaxes(title_text="loss", row=1, col=2)

fig.update_layout(title="Per-example loss: Brier vs log loss", template="plotly_white")
fig.show()

7) Optimizing a simple model with the Brier loss (logistic regression)#

Because the Brier score is differentiable in the predicted probabilities, you can use it as a training objective.

Let a logistic model predict

\[ p_i = \sigma(z_i), \qquad z_i = w^\top x_i + b, \]

where \(\sigma(t) = 1/(1+e^{-t})\).

The Brier objective is:

\[ L(w,b) = \frac{1}{n}\sum_{i=1}^n (p_i - y_i)^2. \]

Using the chain rule:

\[ \frac{\partial L}{\partial z_i} = \frac{2}{n}(p_i - y_i)\,p_i(1-p_i). \]

So the gradients are:

\[ \nabla_w L = X^\top \frac{\partial L}{\partial z}, \qquad \frac{\partial L}{\partial b} = \sum_{i=1}^n \frac{\partial L}{\partial z_i}. \]

A key difference vs log loss: the extra factor \(p(1-p) \le 0.25\) makes gradients smaller near 0/1, which can slow learning on confidently wrong examples.

# Synthetic 2D dataset so we can visualize the learned probabilities
X, y = make_classification(
    n_samples=2000,
    n_features=2,
    n_redundant=0,
    n_informative=2,
    n_clusters_per_class=1,
    class_sep=1.5,
    flip_y=0.03,
    random_state=0,
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0, stratify=y
)

# Standardize (important for gradient descent)
mu = X_train.mean(axis=0)
sigma = X_train.std(axis=0) + 1e-12
X_train_s = (X_train - mu) / sigma
X_test_s = (X_test - mu) / sigma


def train_logreg_brier_gd(
    X: np.ndarray,
    y: np.ndarray,
    lr: float = 1.0,
    n_iter: int = 1000,
    l2: float = 1e-3,
    seed: int = 0,
):
    rng_local = np.random.default_rng(seed)
    n, d = X.shape

    w = rng_local.normal(scale=0.01, size=d)
    b = 0.0

    history = {"brier": []}

    for _ in range(n_iter):
        z = X @ w + b
        p = sigmoid(z)

        loss = np.mean((p - y) ** 2) + 0.5 * l2 * np.sum(w**2)
        history["brier"].append(float(loss))

        dL_dz = (2.0 / n) * (p - y) * p * (1.0 - p)
        grad_w = X.T @ dL_dz + l2 * w
        grad_b = float(dL_dz.sum())

        w -= lr * grad_w
        b -= lr * grad_b

    return w, b, history


w, b, history = train_logreg_brier_gd(X_train_s, y_train, lr=1.0, n_iter=1000, l2=1e-3, seed=0)

p_test = sigmoid(X_test_s @ w + b)

print("From-scratch logistic regression (trained with Brier loss)")
print(f"test accuracy       : {accuracy_score(y_test, (p_test >= 0.5).astype(int)):.3f}")
print(f"test brier (numpy)  : {brier_score_loss_np(y_test, p_test):.4f}")
print(f"test brier (sklearn): {brier_score_loss(y_test, p_test):.4f}")

fig = go.Figure()
fig.add_trace(go.Scatter(y=history["brier"], mode="lines", name="train Brier loss"))
fig.update_layout(
    title="Gradient descent minimizing Brier loss",
    xaxis_title="Iteration",
    yaxis_title="Brier loss (+ L2 if set)",
    template="plotly_white",
)
fig.show()

# Visualize the learned probability surface (test set)
x0_min, x0_max = X_test_s[:, 0].min() - 0.75, X_test_s[:, 0].max() + 0.75
x1_min, x1_max = X_test_s[:, 1].min() - 0.75, X_test_s[:, 1].max() + 0.75

grid_x0 = np.linspace(x0_min, x0_max, 220)
grid_x1 = np.linspace(x1_min, x1_max, 220)
xx, yy = np.meshgrid(grid_x0, grid_x1)
grid = np.c_[xx.ravel(), yy.ravel()]
proba = sigmoid(grid @ w + b).reshape(xx.shape)

fig = go.Figure()
fig.add_trace(
    go.Contour(
        x=grid_x0,
        y=grid_x1,
        z=proba,
        contours=dict(start=0.0, end=1.0, size=0.1),
        colorscale="RdBu",
        opacity=0.75,
        colorbar=dict(title="P(y=1)"),
    )
)
fig.add_trace(
    go.Contour(
        x=grid_x0,
        y=grid_x1,
        z=proba,
        contours=dict(start=0.5, end=0.5, coloring="lines"),
        line=dict(color="black", width=2),
        showscale=False,
        name="p=0.5",
    )
)
fig.add_trace(
    go.Scatter(
        x=X_test_s[:, 0],
        y=X_test_s[:, 1],
        mode="markers",
        marker=dict(
            color=y_test,
            colorscale="Portland",
            size=6,
            line=dict(width=0.5, color="white"),
        ),
        name="test points",
    )
)
fig.update_layout(
    title="Learned probability surface and decision boundary (test set)",
    xaxis_title="feature 1 (standardized)",
    yaxis_title="feature 2 (standardized)",
    template="plotly_white",
)
fig.show()

# Calibration curve on the test set
_, mean_pred, frac_pos, counts = calibration_bins(y_test, p_test, n_bins=10)
mask = counts > 0

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=[0, 1],
        y=[0, 1],
        mode="lines",
        name="perfect calibration",
        line=dict(color="black", dash="dash"),
    )
)
fig.add_trace(go.Scatter(x=mean_pred[mask], y=frac_pos[mask], mode="markers+lines", name="model"))
fig.update_layout(
    title="Reliability diagram (test set)",
    xaxis_title="Mean predicted probability",
    yaxis_title="Observed frequency",
    xaxis=dict(range=[0, 1]),
    yaxis=dict(range=[0, 1]),
    template="plotly_white",
)
fig.show()

# Practical usage: the same metric with a standard scikit-learn classifier
clf = LogisticRegression(max_iter=2000)
clf.fit(X_train_s, y_train)
p_sklearn = clf.predict_proba(X_test_s)[:, 1]
print("\nscikit-learn LogisticRegression (log loss optimized)")
print(f"test accuracy: {accuracy_score(y_test, (p_sklearn >= 0.5).astype(int)):.3f}")
print(f"test brier   : {brier_score_loss(y_test, p_sklearn):.4f}")
From-scratch logistic regression (trained with Brier loss)
test accuracy       : 0.980
test brier (numpy)  : 0.0208
test brier (sklearn): 0.0208
scikit-learn LogisticRegression (log loss optimized)
test accuracy: 0.980
test brier   : 0.0200

8) Pros, cons, and when to use it#

Pros

  • Proper scoring rule for probabilistic forecasts (encourages honest probabilities)

  • Interpretable as “MSE on probabilities”; bounded for binary problems

  • Sensitive to calibration; supports the reliability/resolution/uncertainty decomposition

  • Common in risk prediction, weather forecasting, calibration evaluation

Cons

  • Less punishing for confident wrong predictions than log loss (can matter for training)

  • Value depends on the base rate; comparing scores across datasets can be misleading

  • Binary-only in sklearn.metrics.brier_score_loss (multiclass requires an extension)

  • On rare events, raw Brier numbers can look deceptively small; use a skill score or compare to a baseline

9) Pitfalls, diagnostics, exercises#

Pitfalls

  • Passing class labels instead of probabilities (use predict_proba for probabilistic classifiers)

  • Mis-specified positive class (pos_label) when labels are not {0,1}

  • Interpreting Brier without a baseline on imbalanced data

Diagnostics

  • Plot a reliability diagram (calibration curve)

  • Report a baseline Brier score (base rate) and optionally Brier Skill Score

Exercises

  1. Implement the multiclass Brier score and test it on a 3-class toy problem.

  2. Compute Brier Skill Score using the base rate as reference.

  3. Compare Brier vs log loss for a deliberately overconfident model.

References#

  • scikit-learn docs: https://scikit-learn.org/stable/modules/model_evaluation.html#brier-score-loss

  • Glenn W. Brier (1950), Verification of forecasts expressed in terms of probability

  • Allan H. Murphy (1973), A new vector partition of the probability score